\(~\)

Day 2 Session 3: Data Visualization

\(~\)

This week, we will introduce how to use visualization to observe patterns revealed in our data. There are two major sets of tools for creating plots in R:

\(~\)

We will be focusing on ggplot2 in our class. Because:

\(~\)

Research methods classes generally teach important skills such as probability and statistical theory, linear regressions, maximum likelihood estimation (MLE), machine learning, etc. While these are important methods for analyzing data and assessing research questions, sometimes drawing a picture (a.k.a. visualization) should be a first step and can be even more precise than conventional statistical computations.

\(~\)

Okay, let’s get started!

\(~\)

1. The Dataset

\(~\)

For the following examples, we will be using the gapminder dataset. Gapminder is a country-year dataset with information on life expectancy, among other things.

\(~\)

If you have not already installed the gapminderpackage and you try to load it using the following code, you will get an error:

\(~\)

library(gapminder)
Error in library(gapminder) : there is no package called ‘gapminder’

\(~\)

If this happens, install the gapminder package by running install.packages("gapminder") in your console.

\(~\)

Once you’ve done this, run the following code to load the gapminder dataset, the tidyverse library, which includes ggplot2:

\(~\)

library(tidyverse)
library(gapminder)
## Warning: package 'gapminder' was built under R version 4.0.2
gap <- gapminder 
head(gap)
## # A tibble: 6 × 6
##   country     continent  year lifeExp      pop gdpPercap
##   <fct>       <fct>     <int>   <dbl>    <int>     <dbl>
## 1 Afghanistan Asia       1952    28.8  8425333      779.
## 2 Afghanistan Asia       1957    30.3  9240934      821.
## 3 Afghanistan Asia       1962    32.0 10267083      853.
## 4 Afghanistan Asia       1967    34.0 11537966      836.
## 5 Afghanistan Asia       1972    36.1 13079460      740.
## 6 Afghanistan Asia       1977    38.4 14880372      786.

\(~\)

Challenge

\(~\)

Once you load the date, based on what we’ve learned in previous classes, discuss the following questions within your group.

    1. How many countries and continents are there in the data?
    1. What is the time range?
    1. What is the mean of life expectancy? How does it change from the first year in data to the last year?
    1. Please show the GDP per capita of United States over the year.

\(~\)

(Hint: You can also run ?gapminder in the console to open the help file for the data and definitions for each of the columns.)

\(~\)

2. ggplot2 Grammer

\(~\)

The general call for ggplot2 looks like this:

\(~\)

ggplot(data =, aes(x = , y = )) + 
  geom_xxxx() + 
  geom_yyyy()

\(~\)

The grammar involves some basic components:

    1. Data: a data.frame
    1. Aesthetics: How your data are represented visually, aka its “mapping”. Which variables are shown on x, y axes, as well as color, size, shape, etc.
    1. Geometry: The geometric objects in a plot – histograms, points, lines, smooth lines, etc.

\(~\)

The key to understanding ggplot2 is thinking about a figure in layers: just like you might do in an image editing program like Photoshop

\(~\)

ggplot(data = gap, aes(x = gdpPercap, y = lifeExp)) +
  geom_point()

\(~\)

So the first thing we do is call the ggplot function. This function lets R know that we’re creating a new plot, and any of the arguments we give the ggplot function are the global options for the plot: they apply to all layers on the plot.

\(~\)

For the second argument we passed in the aes function, which tells ggplot how variables in the data map to aesthetic properties of the figure, in this case the x and y locations. Here we told ggplot we want to plot the lifeExp column of the gapminder data frame on the x-axis, and the gdpPercap column on the y-axis.

\(~\)

Notice that we didn’t need to explicitly pass aes these columns (e.g., x = gapminder$lifeExp), this is because ggplot is smart enough to know to look in the data for that column!

\(~\)

Then, we need to tell ggplot how we want to visually represent the data, which we do by adding a new geom layer. In our example, we used geom_point, which tells ggplot we want to visually represent the relationship between x and y as a scatterplot of points:

\(~\)

IMPORTANT: In ggplot, you are adding layers, so you should use + to separate each line of code!

IMPORTANT: In ggplot, you are adding layers, so you should use + to separate each line of code!

IMPORTANT: In ggplot, you are adding layers, so you should use + to separate each line of code!

\(~\)

ggplot(data = gap, aes(x = gdpPercap, y = lifeExp)) +
  geom_point()

\(~\)

Challenge

\(~\)

  1. Modify the example so that the figure visualises how life expectancy has changed over time:

\(~\)

3. Anatomy of aes

\(~\)

In the previous examples and challenge we’ve used the aes function to tell the scatterplot geom about the x and y locations of each point. Another aesthetic property we can modify is the point color.

ggplot(data = gap, aes(x = gdpPercap, y = lifeExp, color = continent)) + 
  geom_point()

Then, we can add a line of code to set your color manually. You can also google the R color palette for detail color code.

ggplot(data = gap, aes(x = gdpPercap, y = lifeExp, color = continent)) + 
  geom_point() +
  scale_color_manual(values = c("gold", "lightblue", "red", "lightgreen", "pink"))

Furthermore, you can modify the opacity of points by alpha in your geom_point setting. alpha is in a range from 0 to 1.

ggplot(data = gap, aes(x = gdpPercap, y = lifeExp, color = continent)) + 
  geom_point(alpha = 0.5)

Color isn’t the only aesthetic argument we can set to display variation in the data. We can also vary by shape, size, etc. For example, we can also set the shape by continent too.

ggplot(data = gap, aes(x = gdpPercap, y = lifeExp, color = continent, shape = continent)) + 
  geom_point(alpha = 0.5)

\(~\)

4. Layers

\(~\)

In the previous challenge, you plotted lifExp over time. Using a scatterplot probably isn’t the best for visualising change over time. Instead, let’s tell ggplot to visualise the data as a line plot:

ggplot(data = gap, aes(x = year, y = lifeExp, by = country, color = continent)) + 
  geom_line()

Instead of adding a geom_point layer, we’ve added a geom_line layer. We’ve also added the by aesthetic, which tells ggplot to draw a line for each country.

\(~\)

But what if we want to visualize both lines and points on the plot? We can simply add another layer to the plot:

ggplot(data = gap, aes(x = year, y = lifeExp, by = country, color = continent)) + 
  geom_line() + 
  geom_point()

It’s important to note that each layer is drawn on top of the previous layer. In this example, the points have been drawn on top of the lines. Here’s another demonstration:

ggplot(data = gap, aes(x = year, y = lifeExp, by = country)) + 
  geom_line(aes(color = continent)) + 
  geom_point()

In this example, the aesthetic mapping of color has been moved from the global plot options in ggplot to the geom_line layer so it no longer applies to the points. Now we can clearly see that the points are drawn on top of the lines.

\(~\)

Challenge

\(~\)

  1. Switch the order of the point and line layers from the previous example. What happened?

\(~\)

5. Labels and Themes

\(~\)

Labels are considered to be their own layers in ggplot. You can use labs(x = , y = , title = ) to set your labels.

# add x and y axis labels
ggplot(data = gap, aes(x = gdpPercap, y = lifeExp, color=continent)) + 
  geom_point(alpha = 0.5) + 
  labs(x = "GDP per capita (in US$)", y = "Life Expectancy (in years)", 
       title = "Relations of Life Expectancy and Ecomonic Development, by Continent")

You can also modify the theme of your plots. The themes in ggplot include theme_bw(), theme_classic(), theme_light(), theme_void(), etc.

# add x and y axis labels
ggplot(data = gap, aes(x = gdpPercap, y = lifeExp, color = continent)) + 
  geom_point(alpha = 0.5) + 
  labs(x = "GDP per capita (in US$)", y = "Life Expectancy (in years)", 
       title = "Relations of Life Expectancy and Ecomonic Development, by Continent") +
  theme_bw()

\(~\)

Challenge

\(~\)

  1. Try different themes in ggplot, and discuss in your group which one you prefer.

\(~\)

6. Transformations and Statistics

\(~\)

In ggplot, we can change the scale of units on the x-axis using the scale functions. These control the mapping between the data values and visual values of an aesthetic.

ggplot(data = gap, aes(x = gdpPercap, y = lifeExp, color = continent)) + 
  geom_point(alpha = 0.5) + 
  scale_x_log10() + # this sets the value in x asix in its log10
  labs(x = "Logged GDP per capita (in US$)", y = "Life Expectancy (in years)", 
       title = "Relations of Life Expectancy and Ecomonic Development, by Continent")

We can also manually do that in the global aesthetic setting. For example,

# Here I take the natural log transformation on GDP per capita
ggplot(data = gap, aes(x = log(gdpPercap), y = lifeExp, color = continent)) + 
  geom_point(alpha = 0.5) + 
  labs(x = "Logged GDP per capita (in US$)", y = "Life Expectancy (in years)", 
       title = "Relations of Life Expectancy and Ecomonic Development, by Continent")

\(~\)

ggplot also provides us several useful statistical tools. One of the most useful tools is the smooth line, which draws regression lines for us. We can fit a simple relationship to the data by adding another layer, geom_smooth:

ggplot(data = gap, aes(x = log(gdpPercap), y = lifeExp, color = continent)) + 
  geom_point(alpha = 0.5) + 
  geom_smooth(method = "lm") +
  labs(x = "Logged GDP per capita (in US$)", y = "Life Expectancy (in years)", 
       title = "Relations of Life Expectancy and Ecomonic Development, by Continent")
## `geom_smooth()` using formula = 'y ~ x'

Note that we have 5 lines, one for each continent, because of the color option is the global aes function. But if we move it, we get different results:

ggplot(data = gap, aes(x = log(gdpPercap), y = lifeExp)) + 
  geom_point(aes(color = continent), alpha = 0.5) + 
  geom_smooth(method = "lm") +
  labs(x = "Logged GDP per capita (in US$)", y = "Life Expectancy (in years)", 
       title = "Relations of Life Expectancy and Ecomonic Development, by Continent")

So, there are two ways an aesthetic can be specified. Here, we set the color aesthetic by passing it as an argument to geom_point. Previously, we used the aes function to define in a global setting.

We can make the line thicker by setting the size and color aesthetic in the geom_smooth layer:

ggplot(data = gap, aes(x = log(gdpPercap), y = lifeExp)) + 
  geom_point(aes(color = continent), alpha = 0.5) + 
  geom_smooth(method = "lm", size = 2, color = "red") +
  labs(x = "Logged GDP per capita (in US$)", y = "Life Expectancy (in years)", 
       title = "Relations of Life Expectancy and Ecomonic Development, by Continent")
## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.

## Warning: Please use `linewidth` instead.

\(~\)

Lastly, You can use dplyr functions that we learned in the last week to choose the data we want. For example, if we only take care of the data on Asia and Americas before and after 1990s.

# before 1990s
gap %>%
  filter(continent == "Americas" | continent == "Asia") %>%
  filter(year <= 1990) %>%
  ggplot(aes(x = log(gdpPercap), y = lifeExp, color = continent)) + 
  geom_point(alpha = 0.5) + 
  geom_smooth(method = "lm") +
  labs(x = "Logged GDP per capita (in US$)", y = "Life Expectancy (in years)", 
       title = "Relations of Life Expectancy and Ecomonic Development, Before 1990")

# after 1990s
gap %>%
  filter(continent == "Americas" | continent == "Asia") %>%
  filter(year > 1990) %>%
  ggplot(aes(x = log(gdpPercap), y = lifeExp, color = continent)) + 
  geom_point(alpha = 0.5) + 
  geom_smooth(method = "lm") +
  labs(x = "Logged GDP per capita (in US$)", y = "Life Expectancy (in years)", 
       title = "Relations of Life Expectancy and Ecomonic Development, After 1990")

Pay attention here, when we use dpylr and pipes, we have %>% to separate lines; however, in ggplot, we have + instead!

\(~\)

Challenge

  1. Can you replicate these figures?

(Hint: replace color with shape, and shape label values are here.)

\(~\)

library(tidyverse)
library(gapminder)

gap <- gapminder 

\(~\)

7. Facets

Previously, we visualized the change in life expectancy over time across all countries in one plot. Alternatively, we can split this out over multiple panels by adding a layer of facet panels.

\(~\)

facet_wrap() is a useful tool to display patterns for different groups. For example:

ggplot(data = gap, aes(x = year, y = lifeExp)) +
  geom_point() + 
  facet_wrap(~ continent)

\(~\)

If we would like to compare five continents in the same line, we can use ncol = or nrow to set how many facets we’d like to present in each column or row.

\(~\)

ggplot(data = gap, aes(x = year, y = lifeExp)) +
  geom_point() + 
  facet_wrap(~ continent, ncol = 5)

\(~\)

8. Scales and Legends

\(~\)

8.1 Scales

Scales control the mapping from data to aesthetics. They take your data and turn it into something that you can see, like size, color, position or shape. Scales also provide the tools that let you read the plot: the axes and legends. You can generate many plots without knowing how scales work, but understanding scales and learning how to manipulate them will give you much more control.

\(~\)

Take the life expectancy over the years as an example:

ggplot(data = gap, aes(x = year, y = lifeExp)) +
  geom_point() +
  scale_y_continuous(limits = c(20, 100))

\(~\)

We can set the scale for y axis by adding a layer scale_y_continuous(), since the lifeExp is a continuous variable. We can modify its limits by limits = and what values to show by breaks.

\(~\)

ggplot(data = gap, aes(x = year, y = lifeExp)) +
  geom_point() +
  scale_y_continuous(limits = c(20, 100), breaks = c(20, 30, 40, 50, 60, 70, 80, 90, 100))

\(~\)

We can also assign different labels to the values, by the labels argument.

\(~\)

ggplot(data = gap, aes(x = year, y = lifeExp)) +
  geom_point() +
  scale_y_continuous(limits = c(20, 100), breaks = c(30, 60, 90), 
                     labels = c("low (30)", "medium (60)", "high (90)"))

8.2 Legends

Legends are more complicated than axes. Because:

\(~\)

    1. A legend can display multiple aesthetics (e.g., color and shape), from multiple layers, and the symbol displayed in a legend varies based on the geom used in the layer.
    1. Axes always appear in the same place. Legends can appear in different places, so you need some global way of controlling them.
    1. Legends have considerably more details that can be tweaked: should they be displayed vertically or horizontally? How many columns? How big should the keys be?

\(~\)

The following sections describe the options that control these interactions.

ggplot(data = gap, aes(x = year, y = lifeExp, color = continent)) +
  geom_point() +
  theme_bw()

\(~\)

By default, a layer will only appear if the corresponding aesthetic is mapped to a variable with aes(). You can override whether or not a layer appears in the legend with show.legend = FALSE to prevent a layer from ever appearing in the legend; TRUE forces it to appear when it otherwise wouldn’t.

\(~\)

ggplot(data = gap, aes(x = year, y = lifeExp, color = continent)) +
  geom_point(show.legend = FALSE) +
  theme_bw()

\(~\)

You can also change the location of legend with theme() function. The position and justification of legends are controlled by the theme setting legend.position, which takes values “right”, “left”, “top”, “bottom”, or “none” (no legend).

\(~\)

ggplot(data = gap, aes(x = year, y = lifeExp, color = continent)) +
  geom_point() +
  theme_bw() +
  theme(legend.position = "bottom")

\(~\)

Alternatively, if there’s a lot of blank space in your plot you might want to place the legend inside the plot. You can do this by setting legend.position to a numeric vector of length two. The numbers represent a relative location in the panel area: c(0, 1) is the top-left corner and c(1, 0) is the bottom-right corner. You control which corner of the legend the legend.position refers to with legend.justification, which is specified in a similar way. Unfortunately positioning the legend exactly where you want it requires a lot of trial and error.

\(~\)

ggplot(data = gap, aes(x = year, y = lifeExp, color = continent)) +
  geom_point() +
  scale_y_continuous(limits = c(0, 100)) +
  theme_bw() +
  theme(legend.position = c(1, 0), legend.justification = c(1, 0))

\(~\)

(TO BE ADDED - JUNLIU)

Junliu, in this section, we need some challenges for participants to generate figures using the data they shared with us. Based on the content above, can you come up with some challenging questions? Also, can you give code for some examples in a R script?

\(~\)

\(~\)

\(~\)

Acknowledgements

This page is, in part, derived from the following sources:

  1. R for Data Science licensed under Creative Commons Attribution-NonCommercial-NoDerivs 3.0.

  2. R Studio Support.

  3. Rochelle Terman’s class notes for PLSC 31101: Computational Tools for Social Science.